Main features of interest in the dataset
The main features of interest in the dataset are alcohol and quality. I would like to dtermine if the amount of alcohol affected the rating of a wine.
The dataset being explored in this analysis contains information about the quality ratings of several white wines. Each instance contains a quality rating for the wine between 0(very bad) and 10(very excellent) and also provides information about various chemical properties of the wine.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
It can be seen that the dataset contains 4898 instances and there are 11 variables which describe the chemical properties of the wine. The variable ‘quality’ provides a rating for the wine.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The distribution of the counts of ‘fixed.acidity’ variable looks normal. Most of the wines have a fixed.acidity value between 6 and 7.5 and there are very few wines whose whose fixed acidity is less than 5 or more than 9.
From the above histogram it can be noticed that the value of ‘volatile.acidity’ is a decimal usually between 0 and 1. Most of the wines have a volatile.acidty value of 0.2 or 0.3 and there are very few wines whose volatile.acidity value is more than 0.5.
The amount of citric acid in about half of the wines is 0.3 and about 75% of the wines have a citric acid amount less than 0.4. There are also a few outliers where the amount of citiric acid is more than 1.
The distribution of the ‘residual.sugar’ variable looks skewed with almost half of the wines have a residual.sugar value between 0 and 5. About 75% of the wines have a residual.sugar value less than 10 and there are a few outliers whose residual.sugar value is more than 20.
Transformed the skewed distribution of ‘residual.sugar’ using log10 to acheive a better distribution. The transformed distribution looks bimodal with peaks at around 3 and 9.
The ‘chlorides’ variable represents the amount of salt in the wine. Most of the wines have a chlorides amount of 0.04 and there are very few wines whose chlorides amount is more than 0.06.
The above histogram shows that most of the wines have a free sulfur dioxide level between 20 and 40. About 75% of the wines have free sulfur dioxide levels between 2 and 45. There are less number of wines whose free sulfur dioxide level is more than 60.
The amount of total sulfur dioxide in wines have a wider range than free sulfur dioxide as it is the sum of free and bound sulfur dioxide. The distribution looks normal with most of the wines having a total sulfur dioxide levels between 100 and 160. The average amount of total sulfur dioxide in the wines is 138.4 gm/litre.
The value of lowest alcohol content in a wine is 8 and the value of highest alcohol content is around 14. Most of the wines have alcohol content between 9 and 10.5. About 75% of the wines have an alcohol content value between 8 and 11.5.
Most of the wines received a quality rating from 5 to 7. There are very few wines with a quality rating more than 7 and there are no wines which received a rating of 10. It can also be seen that the lowest quality rating received is 3.
The dataset contains 4898 instances of quality ratings received by various white wines. Each instance also contains other variables which represent the chemical factors of the wine. These variables are ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘residual.sugar’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘density’, ‘pH’, ‘sulphates’, ‘alcohol’. All the variables other than quality are represented by decimal values.
About 50% of the wines received a rating of 6 or lower and only 25% of wines received a rating greater than 6. Most of the wines have alcohol content between 9 and 11.
The main features of interest in the dataset are alcohol and quality. I would like to dtermine if the amount of alcohol affected the rating of a wine.
The other features that could support the investigation are residual.sugar, density and total sulfur dioxide. The amount of residual sugar in wine that sweetens the wine could have affected the rating of the wine. The amount of total sulfur dioxide is also important as it pevents the oxidation of the wine. Excess oxidation causes the wine to degrade and lose its aroma and taste.
The distribution of the amount of sugar represented by residual.sugar variable had a skewed distribution. Transforming this distribution using log10 transformation changed the distribution to bimodal with peaks at 3 and 9.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
The above plot indicates that the wines with higher quality have high amounts of alcohol. The wines which received a lower quality rating from 3 to 5 have average alcohol amount of around 10% whereas the wines with higher quality ratings have average alcohol amount more than 11%.
It can be noticed that the wines with higher quality have lower density than that of the wines with lower quality. The density of almost all the wines is within the range 0.985 to 1.
The amount of residual sugar appears to be in same amounts across all the qualities of the wine.
The amount of total sulfur dioxide does not seem to have any relationship with the quality of the wine. The wines of all quality have total sulfur dioxide amounts in the same range between 50 and 250.
The relationship is not very strong between residual sugar and total sulfur dioxide. There are a large number of wines with residual sugar amount less than 5 and having a total sulfur dioxide amount between 50 and 200. As the amount of residual sugar in wine increases there is a slight increase in the amount of total sulfur dioxide as well.
The relationship between residual.sugar and density looks almost linear. The wines that have more amounts of residual sugar also have a higher density than the wines with lower amounts of residual sugar.
There is no strong relationship between residual.sugar and alcohol. But it can be noticed that as the amount of alcohol in the wine increased the variance in the amount of residual sugar decreased. And the wines with highest alcohol amount had residual sugar amount of around 10 gm/litre whereas there are wines with lower alcohol amount with residual sugar amount of 20 gm/litre.
The above plot reveals a slight negative linear relationship between alcohol and density. It can be seen that the wines with high amounts of alcohol have less density that that of the wines with low amounts of alcohol.
The quality rating received by a wine did not have a strong relationship with any other variables. Quality of a wine was compared with variables like alcohol, density, residual.sugar and total.sulfur.dioxide. Almost all the wines had the same range of values for these other variables.
To find how the mean and median values of these variables compared to the quality of wine, I have created a new variable named “quality_class”. The wines with quality rating less than or equal to 4 belong to the class “low”, the wines with quality rating 5 or 6 belong to the class “medium” and the wines with quality rating higher than 6 belong to the class “high”.
It was observed that the wines with higher quality had more amounts of alcohol and less density than the wines with lower quality. The amounts of residual sugar was almost all the same in all the quality classes of wine.
The other features explored were residual.sugar, density and alcohol. There is no strong relationship between residual.sugar and alcohol but density has strong relationships with both residual sugar and alcohol. There is a positive linear relationship between residual sugar and density and there is a negative linear relationship between alcohol and density.
The strongest relationship found was between residual sugar and density and also between alcohol and density.
The wines that received a higher quality rating appear to have lower density at lower amounts of residual sugar when compared to the wines with lower quality ratings.
As we have already noticed that wines with higher quality ratings have lower density and high amounts of alcohol, this fact is again revealed by this plot. It can be noticed that until 10% of alcohol by vloume, the wines with high quality have higher density but after that these wines have the lowest density than the other wines.
As the previous plots indicated, the wines with higher quality ratings have higher amounts of alcohol at lower amounts of residual sugar. It can also be noticed that the amount of alcohol in the wines belonging to medium quality ratings suddenly increase at higher amounts of residual sugar.
This part of the analysis did not reveal any new features or patterns, instead it strengthened the patterns found in the previous parts of the analysis. As we have already seen that wines with high quality have higher amounts of alcohol and lower density, surprisingly it can be seen from the plot density vs alcohol that wines with high quality have higher density than wines with lower quality until 10% alcohol by volume. After 10% alcohol by volume, the density of the wines is less than the lower quality wines.
This plot indicates that wines with higher quality have more amounts of alcohol than in the wines with lower quality. Even though the difference in the means of the amount of alcohol is not very big, it is still considerable.
This plot shows a linear relationship between the amount of residual sugar and the density of the wine. The reason for this linear relationship could be the fact that density of wine is measured based on the percentage of alcohol and the sugar content of the wine.
Thee above two plots look interesting because the findings prior to this plot indicated that usually the wines that belonged to higher quality class have slightly high amount of alcohol and lower density. But this plot shows that for a particular alcohol content, the density of the wines with higher quality rating is higher than the wines with lower quality rating. But this trend is only observed at alcohol amounts between 8 and 10. At alcohol amounts greater than 10, the wines with higher quality rating have lower density.
The white wine dataset contains 4898 instances of quality rating received by various wines. Each instance also contains 11 other variables based on the physicochemical tests. I started the EDA of this dataset by looking at the distribution of these individual variables. All the variables except the quality variable had numerical values. The distribution of residual.sugar variable was skewed and I transformed the variable using log transformation.
The variables did not have strong correlation coefficients and the only variables pairs which had some reasonable correlation were residual.sugar & alcohol, residual.sugar & density, density & alcohol, quality & alcohol. Exploring these pairs did not result in any strong relationships apart from residual.sugar vs density and density vs alcohol which had linear relationships. This lack of correlation among variables caused trouble as the plots did not reveal any proper relationships to explore further.
To determine how wines with similar quality rating compared to each other, I created a new variable called “quality_class” with values “low”, “medium” and “high”. Exploring these classes indicated that the wines that belonged to the ‘high’ quality class had lower density and high amounts of alcohol when compared to the other lower classes of wines. The amount of residual sugar was almost the same in the wines of all the classes. Dividing the wines into these classes and exploring them turned out to be a success as it helped in observing the trends in alcohol and density among the wines.
The exploratory analysis performed so far did not result in any strong relationships that could be used for building a predictive model. The reason for this could be the lack of important characteristics of wine in the dataset. The analysis could be improved if the dataset also had other characteristics such as the type of soil in which the grapes used were grown, the species of grapes used, the climate in which the wine was made etc.